Neural Experience Experience Replay Sampler (NERS)

1 Overview

Neural Experience Replay Sampler (NERS) is neural network learns sampling score $$\sigma _i$$ for each transition. RL policy is trained like PER, except using inferred score $$\sigma _i$$ instead of $$\mid\text{TD}\mid$$.

$p_i = \frac{(\sigma _i) ^{\alpha}}{\sum _{k \in [\mid B\mid ]} (\sigma _k) ^{\alpha}}$

NERS consists of 3 neural networks, local network $$f_l$$, global network $$f_g$$, and score network $$f_s$$. $$f_l$$ calculates for each transisiton. $$f_g$$ calculates for each transition then these outputs are averaged over mini-batch as global context.$$f_s$$ takes the concatenation of the outputs of $$f_l$$ and the global context and calculates scores $$\sigma _i$$ for each transition.

The inputs of NERS are following set of features from transitions;

$D(I) = \lbrace s _{\kappa(i)}, a _{\kappa(i)}, r _{\kappa(i)}, s _{\kappa(i)+1}, \kappa(i), \delta _{\kappa(i)}, r_{\kappa(i)} + \gamma \max _a Q _{\hat{\theta}} (s_{\kappa(i)},a) \rbrace _{i \in I}$

where $$\kappa(i)$$ is $$i$$-th time step, $$\gamma$$ is discount factor, $$\delta _{\kappa(i)}$$ is TD error, and $$\hat{\theta}$$ is target network parameter.

NERS are trained at each episode end with subset transitions of that used for updating of actor and critic during the episode.

In order to optimize replay reward $$r ^{\text{re}} = \sum _{t \in \text{current episode}} r_t - \sum _{t \in \text{previous episode}} r_t$$, the following gradients are used;

$\nabla _{\phi} \mathbb{E} [r^{\text{re}}] = \mathbb{E}[r^{\text{re}}\sum _{i\in I_{\text{train}}} \nabla _{\phi} \log \sigma _i (D(I_{\text{i}}))]$

2 With cpprb

You can implement NERS with the current cpprb. For main replay buffer, you can use PrioritizedReplayBuffer. For index collection, you can use ReplayBuffer. Additionaly you need to implement neural network to learn sampling probability.